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ABSTRACT 

We analyze the structure and evolution of discussion cas- 
cades in four popular websites: Slashdot, Barrapunto, Me- 
neame and Wikipedia. Despite the big heterogeneities be- 
tween these sites, a preferential attachment (PA) model with 
bias to the root can capture the temporal evolution of the ob- 
served trees and many of their statistical properties, namely, 
probability distributions of the branching factors (degrees), 
subtree sizes and certain correlations. The parameters of 
the model are learned efflciently using a novel maximum 
likelihood estimation scheme for PA and provide a figura- 
tive interpretation about the communication habits and the 
resulting discussion cascades on the four different websites. 

Categories and Subject Descriptors 

J. 4 [Computer Applications]; Social and Behavioral Sci- 
ences — Sociology; G.2.2 [Mathematics of Computing]; 

Graph Theory — Network problems, Trees 

General Terms 

measurement, algorithms, human factors 

Keywords 

discussion cascades, threads, conversations, preferential at- 
tachment, maximum likelihood, Slashdot, Wikipedia 

1. INTRODUCTION 

Human communication patterns on the Internet are char- 
acterized by transient responses to social events. Examples 
of such phenomena are the discussion threads generated in 
news aggregators, the propagation of massively circulated 
Internet chain letters, or the synthesis of articles in collabo- 
rative web-based spaces such as Wikipedia. 

These responses can be regarded as tree-like cascades of 
activity generated from an underlying social network. Typ- 
ically, a trigger event, or a small set of initiators, generate a 
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chain reaction which may catch the attention of other users 
who end up participating in the cascade (see Figure [1] for 
examples) and attract even more users. Since these cas- 
cades of comments are a direct consequence of the informa- 
tion flow in a social system, understanding the mechanisms 
and patterns which govern them plays a fundamental role 
in contexts like spreading of technological innovations [23) . 
diffusion of news and opinion [111 |20] , social influence [l] or 
collective problem-solving |15) . 

Although information cascades have been extensively an- 
alyzed for particular domains, such as blogs [111 I20| . chain 
letters [21] , Flickr ^ , Twitter [17] or page diffusion on Face- 
book [24], the cascades under consideration in those studies 
rarely involve elaborated discussions or complex interchange 
of opinions; generally, a small piece of information is just 
forwarded from an individual to its direct neighbors. To the 
best of our knowledge, with the exception of 16 , no previ- 
ous work exists on modeling the evolution and structure of 
long discussion-based cascades. 

Here, as in [16], we consider several websites where the 
associated (discussion) cascades contain high level of inter- 
action. We analyze for the first time the cascades of the 
popular news aggregator Slashdot, Barrapunto (a Spanish 
version of Slashdot) and Meneame (a Spanish Digg-clone) 
and the English Wikipedia. As the reader may notice, these 
datasets are quite heterogeneous. For instance, although 
posts from both Slashdot and Meneame correspond to popu- 
lar news which rely on broadcasted events, Slashdot contains 
rich and very extensive comments, which are less frequent in 
Meneame. The cascades found in Wikipedia, on the other 
hand, represent collaborative effort towards a well defined 
goal; produce a free, reliable article. 

In this study we address the following questions; what 
are the statistical patterns that determine the structure of 
such cascades and their evolution? Can these patterns be 
largely determined regardless of semantic information us- 
ing a simple parametric model? Can the parameterization 
corresponding to a given website provide a global character- 
ization for it? 

We first provide a global analysis of the cascade behavior 
in the four mentioned websites. Among other results, we find 
that typically, the sizes of the cascades have a clear defined 
scale, which seems to contradict the recent results of [16] . 
Our analysis also highlights the importance of repetitive 
user participation in relation to other types of cascades and 
their impact on the entire social network. We also present a 
growth model for discussion cascades which is validated in 
the four datasets. Our approach is based on a simple model 



(a) Slashdot 



(b) Barrapunto 



(c) Meneame 



(d) Wikipedia 



Figure 1: Examples of real discussion cascades. 



of preferential attachment (PA) [2 , where new contributions 
in the cascade tree are linked to existing contributions with 
a probability which depends on their popularity (degree). 

Two key ingredients characterize our approach: First, we 
account for a certain bias favoring the root, or event ini- 
tiator. In this way, we are able to capture the different 
processes governing the global (direct reactions) and the lo- 
calized responses of the system. Second, we use a likelihood 
method particularly developed for this study which allows 
an efficient estimation of the model parameters which con- 
siders the entire generative model. The method is applicable 
not only for the data considered here but for a more general 
class of growing graphs. Here we are only interested on the 
stochastic process which generates the cascade. We do not 
model network dynamics or a termination criteria for the 
cascades. Such a model could be built on top of our current 
model as it is done for example in [16j . 

In the next Section, we explain the proposed model and 
how we estimate its parameters. Section [3] introduces the 
datasets and provides a global analysis about their main 
characteristics. In Section |4] we explain the main results 
and give an interpretation of the parameters of the model. 
Finally, in Section [5] we describe related work and discuss 
the results in Section [6l In the Appendix we explain some 
aspects of the likelihood approach which are important for 
the estimation of parameters. 



2. GROWING TREE MODELS FOR DISCUS- 
SION CASCADES 

We model a discussion cascade as a growing network in 
which nodes correspond to comments and the initial node 
corresponds to the post (a news article, for instance). A 
new node is added sequentially at discrete time-steps. Our 
model is based on the original PA model to which we add a 
bias to the first node. Since each new node adds only one 
new link to the existing graph, the resulting network is a 
tree. We also assume that the total number of nodes A'' is 
known. It is convenient to represent compactly the cascade 
as a vector of parent nodes tt, where nt denotes the parent 
of the node t + 1 added at time-step t. 



We are interested in the probability of being node k the 
parent nt given the past history Tr^i-t-i), that is p(7rt = 
fc|7rn:i-i)), for t > 1, fc = {1, ■ ■ ■ ,t} and initial vector tti = 



K 1 77(1 

(1)q Note that by construction, ivt < t,\/t. 

At time-step t, we relate the popularity of a node k with 
its number of links (degree dk,t) before node t + 1 is added 
in the following way: 



(l:t-l); 



l + E^ia-^fc-™ forfcG{l,...,t} 







otherwise 



, (1) 



where 5 is the Kronecker delta function. In the following, 
we omit the explicit dependence on i^(i:t-i), so that dk,t = 

rffc,t(l'(l:t-l))- 




Figure 2: Small example: at time-step 9, node num- 
ber 10 is added to the cascade. With probability 
proportional to (/3di)°i it is added to the root node 
(initiator) and with probability proportional to d'^f 
to one of the non-root nodes. Bottom right shows 
the corresponding parent vector tt (see text for def- 
initions). 

The PA model attaches new nodes to node k with prob- 
ability proportional to its popularity. See Figure [2] for an 
illustration. For completeness, we consider two models: a 
simple PA model without bias to the root and a model which 
differentiates between the root node and the rest. 



^ At time we have ttq = () and for all trees, p{-Ki 
and otherwise, i.e. tti = (l) always. 
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In the general PA model, the probability for attaching the 
node t+1 to node k at time-step f > 1 is parameterized using 
a linear term Pk and an exponent at for each of the nodes: 



p{TTt = fc|7r(i;t_i)) 



(2) 



Model without bias: If we set Uk = a and j3k ~ 1, for 
k = {l,...,t}, we recover an important generalization of 
Barabasi's PA model, where the probability of attachment 
to a node goes as some general power a of the degree [181 
113) . For a = 1, the linear preferential attachment is recov- 
ered. In this case, nodes have power-law distributed degrees. 
For a < 1, or sublinear PA, the degrees are distributed ac- 
cording to a stretched exponential. For a > 1 there is a 
"condensation" phenomenon, in which a single node gets a 
finite fraction of all the connections in the network |18) . 
Model with bias: Consider the following parameteriza- 
tion: 



ctk = 



I3k = 



ai for fc = 1 

Uc for fc G {2, . . . , t} 

P for fc = 1 

1 for fc e {2, . . . , ■ 



(3) 



In this case, ai and are the exponents of the PA processes 
governing the root and the non-root nodes respectively. j3 
can be regarded as an additional degree of freedom weighting 
the root of the tree. In Section 14.41 we discuss about the 
interpretability of these parameters. 

Note that, although we explicitly model the event which 
triggers the cascade as a root node, this representation does 
not limit the cascade to be originated from an individual 
event only. The root node can of course represent a group 
of initiators. 

2.1 Maximum likelihood parameter estimation 

Usually, PA in evolving networks is measured by calculat- 
ing the rate at which groups of nodes with identical connec- 
tivity form new links during a small time interval At [131 14] . 
However, this approach is suitable only for networks with 
many nodes that are stationary in the sense that the num- 
ber of nodes remain constant during the interval At. This 
is not a reasonable assumption in our data, which is often 
produced by a transient, highly nonstationary, response. 

Another approach for parameter estimation relies on fit- 
ting a measured property, for instance the degree distribu- 
tion, for which an analytical form can be derived in the 
model under consideration. For the PA model, extensive 
results exist with emphasis precisely on the degree distri- 
butions [3. However, two important aspects are worth to 
mention here. First, analytical results usually rely on as- 
sumptions like a continuum limit or on an infinite size net- 
work, which is also not the case of our data. Second, it is 
important to stress here that when parameters are learned 
for a particular observable for which an analytical form has 
been derived, the model may overfit on this measure, intro- 
ducing a bias in other structural properties such as subtree 
sizes, average depths, or other correlations. 

Our approach considers the likelihood function correspond- 
ing to the entire generative process (instead of particular 



measures such as degree distributions or subtree sizes) intro- 
duced before. We can assign to each observation (each node 
arrival in each cascade) a given probability using Equation 
([2]). The parameters for which the likelihood is maximal are 
the ones that best explain the data given the model assump- 
tions (see [25] for a similar approach for another network 
growth model). 

Formally, we observe a set H := {tti, . . . ttjv} of A*" trees 
with respective sizes Itt^I, i G {1, . . . A'^} and we want to find 
the values of := {ai,ac,/3) which best explain the data. 
The likelihood function can be written as: 

iV 

c(n\0) = l[p{7T,\0) 

1=1 

1=1 4=2 

N kil / t \ -1 

=nn(/5-'^-.M)'^ME(/^"^'.Mr , (4) 

1=1 t=2 \i = l / 

where TT(i;t-i).i is the vector of parents in the tree i after 
time t — 1, X := 'Kt,i is the parent of node t -I- 1 in the tree 
i, and dx^t.i '■= dx,t{''^(i:t-i),i) denotes the degree of node x 
as in Equation U]) in the tree i. Instead of maximizing Q 
directly, it is more convenient to minimize the negative of 
the log-likelihood function: 

log£(n|6>) =J2Y.'^-(^°Sl3:. +^Ogdx^t,^) - logZtATT^le), 

i = l t = 2 

(5) 

where ZtA-^,\0) = YlLi Widut.i)"' ■ 

For more details about the optimization see the Appendix. 

3. DATASETS 

We have analyzed the discussion cascades of four websites. 
In the following paragraphs we give a more detailed descrip- 
tion of the datasets and the corresponding websites. Global 
descriptive statistics can be found in Tabled] 

Slashdot (SL) : SlashdolQ is a popular technology-news 
website created in 1997 that publishes frequently short news 
posts and allows its readers to comment on them. Slashdot 
has a community based moderation system that awards a 
score to every comment and upholds the quality of discus- 
sions. The comments can be nested which allows us to ex- 
tract the tree structure of the discussion. A single news post 
triggers typically about 200 comments (most of them in a 
few hours) during the approx. 2 weeks he is open for discus- 
sion. Our dataset contains the entire amount of discussions 
generated at Slashdot during a year (from August 2005 to 
August 2006). See 8 for more details about this dataset. 

Barrapunto (BP) : Barrapuntc0 is a Spanish version 
of Slashdot created in 1999. It runs the same open source 
software as Slashdot, making the visual and functional ap- 
pearance of the two sites very similar. They differ in the 
language they use and the content of the news stories dis- 
played, which normally does not overlap. The volume of ac- 
tivity on Barrapunto is significantly lower. A news story on 
Barrapunto triggers on average around 50 comments. Our 
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Table 1: Dataset statistics for Slashdot (SL), Barrapunto (BP), Meneame (MN) and Wikipedia (WK). 



dataset 


^cascades 


#nodes (comments) 


max. nodes 


max. users 


total users 


repeated user 


SL 


9,820 


2,028,518 


1,567 


1,031 


93,638 


> 1 99% 


BP 


7,485 


357,951 


841 


180 


6,864 


> 1 85% 


MN 


58,613 


2,220,714 


2,718 


1,021 


53,877 


> 1 34% 

> 5 70% 


WK 


871,485 


9,421,976 


32,664 


5,969 


350,958 


> 1 34% 

> 5 96% 



dataset contains the activity on Barrapunto during tliree 
years (from January 2005 to December 2008). 

Meneame (MN) : Meneam^Bis the most successful Span- 
ish news aggregator. The website is based on the idea of pro- 
moting user-submitted links to news (stories) according to 
user votes. It was launched in December of 2005 as a Span- 
ish equivalent to Digg. The entry page of Meneame consists 
of a sequence of stories recently promoted to the front page, 
as well as a link to pages containing the most popular, and 
newly submitted stories. Registered users can, among other 
things: (a) publish links to relevant news which are retained 
in a queue until they collect a sufficient number of votes to 
be promoted to the front page of Meneame, (b) comment 
on links sent by other users (or themselves), (c) vote {men- 
ear) comments and links published by other users. Contrary 
to both BP and SL, Meneame lacks an interface for nested 
comments. Comments are displayed as a list so that the 
tree structure is hidden However, the tag #n can be used to 
indicate a reply to the n-th comment in the comment list 
and to extract the tree structures we analyze in this study. 
To focus on the most representative cascades, we filter out 
stories that were not promoted, that is marked as discarded, 
abuse, etc. Our dataset contains the promoted stories and 
corresponding comments during the interval between Dec. 
2005 and July 2009. 

Wikipedia (WK) : The Enghsh WikipedicQ is the largest 
language version of Wikipedia. Every article in Wikipedia 
has its corresponding article talk page where users can dis- 
cuss on improving the article. For our analysis we used a 
dump of the English Wikipedia of March 2010 which con- 
tained data of about 3.2 million articles, out of which about 
870,000 articles had a corresponding discussion page with at 
least one comment. In total these article discussion pages 
contained about 9.4 million signed comments. Note that the 
comments are never deleted, so this number reflects the to- 
tality of comments ever made about the articles in the dump. 
The oldest comments date back to as early as 2001. Com- 
ments who are considered a reply to a previous comment 
are indented, which allows to extract the tree structure of 
the discussions. Note that Wikipedia discussion pages con- 
tain, in addition to comments, structural elements such as 
subpages, headlines, etc. which help to organize large dis- 
cussions. We eliminate all this elements and just concentrate 
our analysis on the remaining pure discussion trees. More 
details about the dataset and the corresponding data prepa- 
ration and cleaning process can be found in |19) . For our 
experiments we selected a random subset of 50, 000 articles 
from the entire dataset. Results did not vary significantly 
when using different random subsets of the data. 
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3.1 Global analysis 

In this section we give a brief overview about some gen- 
eral characteristics of the four datasets. Several indicators 
are shown in Table [1] As columns 4 and 5 show, the biggest 
observed discussions can be composed of hundreds of com- 
ments and propagate across hundreds of users. We find the 
biggest discussion in Wikipedia, involving 5, 969 users and 
32, 664 comments. In Barrapunto, however, the biggest con- 
versation comprised 180 users and 841 comments. 

It is interesting to consider this quantity relative to the 
size of the underlying social network (compare columns 5 
and 6, where we indicate the total number of users during the 
crawled period). We see a remarkable fact: the percentage 
of users affected by the largest cascade is very small. In 
particular, it varies from a 1.1% for Slashdot and 2.6% in 
Barrapunto, the dataset which we saw that presented the 
smallest cascade in absolute terms. Globally, these results 
show that even the largest cascades only affect a very small 
portion of the entire underlying social network. 

A characteristic feature of discussion cascades is the high 
frequency of user participation. Evidence of this is provided 
in column 7, where we show the percentage of cascades in 
which at least one user is involved more than once for cas- 
cades with more than two nodes (for MN and WK, we also 
show the percentage for cascades with more than five nodes). 
With the exception of Meneame, all datasets show very high 
values. In Slashdot, practically all posts contained at least 
one user who commented more than once (considering only 
registered users). An important consequence of this fact is 
that information diffusion may not be properly explained 
using epidemic models such as SIR (susceptible-infected- 
recovered) models unlike in other scenarios like photo pop- 
ularity [5] or fanning pages [24j . 

Figure [3] shows the distribution of the cascade sizes of 
the four datasets. As expected, all distributions are pos- 
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Figure 3: Cascade sizes for the different datasets 
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Figure 4: Examples of synthetic discussion cascades. 



itively skewed, showing a high concentration of relatively 
short cascades and a long tail with large cascades. However, 
although all distributions are heavy tailed, we clearly see 
a different pattern between the three news aggregators and 
the Wikipedia. Whereas SL, BP and MN present a distribu- 
tion with a defined scale, the distribution of cascade sizes of 
Wikipedia is closer to a scale-free distribution, in line with 
the cascades found in weblogs W and USENET |T6]. We 
remark that, even in the Wikipedia case, the power-law hy- 
pothesis for the tail of this distribution is not plausible via 
rigorous test analysis: we obtain an exponent of 2.17 at the 
cost of discarding 97% of the data. 

We also observe a progressive deviation from websites with 
a well defined scale such as Slashdot, which could be de- 
scribed using a log-normal probability distribution, towards 
websites with less defined scale such as Meneame, which may 
show a power-law behavior for cascade sizes > 50. Barra- 
punto falls in the middle and, interestingly, is more similar 
to Meneame than to Slashdot. 

The previous considerations imply that, in general, a new 
post in Slashdot can hardly stay unnoticed and will propa- 
gate almost surely over several users. Conversely, most of 
the news in Meneame will only provoke a small reaction and 
reach, if they do, a small group of users. Compared with 
Wikipedia, we can say that Meneame is the news aggrega- 
tor which has most similarities with it. 

Figure [T] illustrates the different types of cascades which 
we found. We plot representative cascades with similar sizes 
selected randomly from each of the four datasets. For Slash- 
dot we can see that the chain reaction is located mainly on 
the initiator event (direct reactions), but some nodes also 
have high degree, resulting in bursty disseminations. We 
could say that after a news article is posted, the collective 
attention is constantly drifting from the main post to some 
new comments which become more popular. In Barrapunto 
we observe similar structures, although their persistence is 
less noticeable. On the contrary, Meneame is characterized 
by having high concentration of nodes at the first level to- 
gether with rare but long chains of thin threads. This repre- 
sents a pattern where only a few comments receive multiple 
replies, but that sporadically can trigger a long dialog be- 
tween a few users. We note that this phenomenon might be 
caused by the fact that the cascade tree and, more impor- 



tantly, the number of replies a comment receives are hidden 
in the interface of Meneame. Finally, the case of Wikipedia 
is very similar to Meneame, but with even longer, more fre- 
quent and finer threads of nodes with very low degree. 

4. RESULTS 

In this section we validate the proposed model by compar- 
ing the real cascades to the ones generated using the model. 

4.1 Model validation description 

We use the cascades from the four datasets to validate 
the proposed PA model with bias. The parameters are 
optimized for each dataset independently using the entire 
dataset and we generate the same number of synthetic cas- 
cades as the number of real cascades extracted from each 
dataset. An alternative validation would be to use a train- 
test paradigm on each dataset independently to prevent over- 
fitting. For simplicity, and since the goal of this study is to 
characterize the different datasets instead of minimizing the 
generalization error of new threads sampled from the model, 
we prefer to use the entire datasets for learning. 

The size of each synthetic cascade is pre-determined draw- 
ing a pseudo-random number from the empirical distribution 
of cascade sizes (see Figure [21). We calculate the following 
quantities from the empirical data and from the synthetic 
cascades produced by the model: 

Root node degree probability distribution: Each cas- 
cade has a root degree, which is the number of direct 
contributions to the root. 

Total degree distribution: We consider the degree prob- 
ability distribution of any node, without differentiating 
root versus non-root nodes. 

Subtree sizes distribution: For each non-root node, we 
compute the probability distribution of the total num- 
ber of its descendants. 

Mean node depth: Each non-root node belongs to one 
level of the cascade. We compute the mean over all 
the levels of all the nodes. 

''The estimated parameter values did not vary significantly 
using different, sufficiently large random subsets of the data, 
as the outcomes of a cross-validation (train-test) procedure 
would have produced. 
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Figure 5: Model validation for Slashdot 



Figure 6: Model validation for Barrapunto 



Size - Proportion of direct reactions : We compute the 
relation between the size of a cascade and the propor- 
tion of direct reactions to the root and analyze if they 
are correlated. 

4.2 Structure of the cascades 

Figures [5] [6] [7] and |8] show plots of the previous quantities 
for each dataset and the outcomes of both PA models with 
and without bias to the root. 

Overall, the model with bias is able to capture reasonably 
well all the measured properties, except the mean depth. In 
particular the degree distributions of the root nodes are very 
accurately reproduced, even though each dataset exhibits a 
different profile (see top-left plot of the figures). For this 
quantity, the difference between using or not a bias term 
is clearly manifested. A model without bias systematically 
produces degree distributions too skewed for the non-root 
nodes and with too short tails for the root nodes, and is not 
able to capture qualitatively the shape of the total degree 
distribution (see top-right plots of the figures) . 

A similar behavior is observed in the correlations between 
the log-size of the cascade and the proportion of direct re- 
actions (bottom plots of the figures). Although the scatter 
plots difi^er substantially across datasets, the model with bias 
is able to reproduce them qualitatively, which is not the case 
for the model without bias (data not shown). 

The model with bias also generates correct subtree sizes 
in general, with the exception of Meneame, which we pos- 
tulate is caused by the particularities of the platform (see 



Section |S] for details). On the contrary, the model without 
bias systematically produces longer tails than the real ones. 

Both models tend to produce shorter tails for the mean 
depth distribution in all datasets. This seems to be a cur- 
rent limitation of the model. Although for Slashdot and 
Barrapunto this deviation is not very severe, for the other 
two datasets we observe clear discrepancies at the tail of the 
distributions. Notice that in this case, the model without 
bias is unable even to reproduce the probability mass cor- 
responding to the first values of the distribution. We will 
return to this point in Section [6l 

To conclude this section, we show in Figure |4] the syn- 
thetic counterpart of Figure [1] where we plot representative 
cascades with similar sizes selected randomly from each of 
the four synthetic datasets. We can see that the generated 
cascades present a strong resemblance with the real ones. 

4.3 Evolution of the cascades 

After having compared the main structural properties of 
the synthetic trees with the real ones, we now investigate 
whether the PA model with bias is also able to reproduce 
the growth process of the cascades. In other words, if we take 
intermediate snapshots of the cascades during their evolu- 
tion, how close match the synthetic trees their archetypes? 

To this end we record two quantities: the width (max- 
imum over the number of nodes per level) and the mean 
depth of the trees every time a new node is added (at ev- 
ery timestep). Note that the timesteps in the model do 
not coincide with the actual time differences between the 
comments. They just reflect the sequence of the comments 
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Figure 7: Model validation for Meneame 



Figure 8: Model validation for Wikipedia 
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attaching to the cascade. In reality, information spreading 
is conditioned to the large heterogeneity present in human 
activity, for instance induced by circadian cycles, which re- 
sults in information transmission speeds governed by subex- 
ponential distributions, i.e. log-normals or power-laws [121 
1141 122] . Capturing the growth process of the real cascades 
is therefore a challenging task for our model. 

The average overall width and depth evolution curves are 
presented in Figures [9l and [TOl for all datasets comparing the 
original cascades (continuous lines with symbols) with the 
biased model (dashed lines). We observe a nearly perfect 
coincidence between our model and the data in the evolu- 
tion of the width of the discussions (Figure [9| , for three of 
the four datasets. Only in the case of Slashdot the model 
underestimates the width of the tree, although it still repro- 
duces the same curve shape if normalized by the final depth. 

The picture in the case of the mean depth (Figure I10|l 
is less favorable, but still shows a reasonable coincidence of 
our model with the data. In the case of Wikipedia, although 
the model underestimates the mean depth, it reproduces a 
rescaled version of it. The other datasets show a similar 
profile. Initially, the synthetic trees are too deep and the 
mean depth is overestimated. This deviation is corrected at 
some point and then the opposite effect takes place: when 
the depth of the synthetic trees saturates, the depth of the 
real ones still grows. The initial deviation is specially severe 
in Slashdot, for which remarkably the final mean depth is 
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Figure 9: Evolution of the width 



very close to the one of the real cascades. A possible way to 
overcome this problem is discussed in Section |6l 

4.4 Interpretation of parameters 

Can we derive conclusions about the communication habits 
which characterize each website based on the obtained pa- 



rameters which best fit each model? Figure 11(a) shows the 
optimal parameter values for each dataset in a three dimen- 
sional plot, where the horizontal and vertical axis correspond 
to ai and oic. respectively and the size of the marker to the 
value /?. Table [2] shows the same values numerically. 

The role of the exponents qi and Oc in the model is to 
quantify the degree of preferential attachment of the root 
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Figure 10: Evolution of the mean depth 



Table 2: Optimal parameters 



dataset 




etc 


/3 


SL 


0.734 


0.683 


1.302 


BP 


0.665 


-0.116 


0.781 


MN 


0.856 


0.196 


1.588 


WK 


0.884 


-1.684 


0.794 



node and the non-root nodes respectively. The higher their 
values, the more relevant is the popularity to determine the 
attractiveness to new nodes in the cascade. For instance, 
values very close to zero imply a random cascade where new 
nodes are linked to existing ones with uniform probability. 
We can use the established theoretical results described in 
Section [2] to characterize the websites under study. 

The different exponents a\ are all sublinear (< 1), rel- 
atively high, and very similar in all datasets, indicating a 
strong PA in the root process of all cascades. The two 
lowest values for this quantity are observed for Barrapunto 
and Slashdot. On the other hand, Meneame and Wikipedia 
present a higher and almost identical value, suggesting a 
very similar role of the root nodes in the PA mechanism of 
both websites. 

A clear segregation between the group of three news media 
websites and the Wikipedia is manifested on Figure [11(a) in 
the value of Oc. Slashdot, has the highest value, Qc ~ 0.68, 
even higher than ai for Barrapunto. It is also very similar 
to ai for the same dataset. This similarity may capture the 
special quality of the Slashdot comments. In a sense, good 
comments may behave like posts and may act eventually as 
effective initiator of information diffusion cascades. 
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Figure 11: Comparison of parameter values for (a) 
the different datasets and (b) the topics of Slashdot. 
Marker sizes encode /? differently in (a) and (b). 



The smaller value of « 0.2 found for Meneame indicates 
that the diffusion of the news comments in this website is 
closer to a random process. This can be again influenced 
by the lack of explicit information about the popularity of a 
given comment. The same is true for Barrapunto although 
its value of Oc ~ —0.1 is slightly negative, indicating a slight 
inverse PA process. 

However, such an inverse PA process is much more promi- 
nent in the case of Wikipedia: whereas its ai is high, indi- 
cating a strong PA in the root process in agreement with 
the other datasets, its is negative and has the largest 
absolute value of all four datasets. What are the implica- 
tions of this result? Once a comment on a Wikipedia article 
has been originated, it will derive in a collaborative recip- 
rocal chain between a very reduced group of contributors. 
So once a node has received a reply it will be three (ex- 
actly: 2^'®*® = 3.2) times less likely to receive another one 
than the replying new comment itself. In other words, nodes 
with degree equal to one (leaf nodes) are much more likely to 
be linked with a new node than nodes which higher degrees. 

Finally, the parameter /3 acts as a weight which expresses 
the trend towards the root of the cascade in relation to the 
subsequent nodes. It is especially important in the beginning 
of the cascade, when the degree of the root node is low, 
and determines whether initially many nodes attach to the 
root or rather to one of the first comments. We observe 
thus that Meneame shows the largest initial predominance 
of direct reactions, while Wikipedia gives higher probability 
mass to the comments, allowing thus large chains already 
with a small number of nodes. The values for Slashdot and 
Barrapunto lie in between indicating an intermediate initial 
preference for the root node, showing Barrapunto a higher 
probability for early reply-chains than Slashdot. 

What would be the scenario if one tries to explain the 
cascades using the simple model without bias to the root? 
We also fit the simple PA model to the data. In that case, we 
would infer mistakenly that both Meneame and Wikipedia 
systems are in the "condensation" regime [18) . since their 
exponents (1.360 and 1.161 respectively) are larger than one. 

The next question we want to answer is how stable are 
these parameters within the same site, i.e. if we split for ex- 
ample different discussions according to their category, do we 
obtain similar parameters? We investigate this question for 
the topic categories in Slashdot (we only consider categories 
with more than 100 discussions) and find (see Figure [ll(b)[ ) 
that the majority of topics has a set of parameters which is 
close to the one obtained for the entire website. 

This is remarkable given the heterogeneous picture that 
is observed if the depths and widths of the discussions are 
considered [9]. So it seems that, although the amount of 
comments which attracts the different categories may be 
different, the actual structure of the discussions follows a 
very similar pattern. However, if we consider the a param- 
eters, we observe three outliers from the main cluster. The 
topics "books" and "ask" have much larger values, indicating 
a more experienced preferential attachment behavior, while 
the topic "games", on the other hand, has the largest differ- 
ence between ai and Oc. It seems that in this topic category 
direct comments to the root node are more frequenlQ while 
in the two other outliers also comments seem to be able to 
attract a reasonable amount of replies. It is also interesting 



^This category is also the one with the shallowest trees [9]. 



to observe the differences in the values of /3, where we find 
the largest trend toward the root for "hardware" and the 
smallest for "books" articles. 

Summarizing, the optimal parameters permit an interpre- 
tation of the communication habits of each social space and 
are relatively stable across different categories within a site. 
This representation also leads to different classification as 
a function of the parameters. The bias to the root node is 
crucial to separate Slashdot and Barrapunto from Wikipedia 
and Meneame according to ai , and Wikipedia from the three 
news aggregators according to a^. 

5. RELATED WORK 

Due to the increasing availability of empirical data on cas- 
cades, extensive work is appearing with focus on how infor- 
mation cascades are propagated in a social network. 

At a statistical description level, information cascades have 
been analyzed in detail for particular social spaces. Twitter 
cascades [T7] are predominantly shallow and wide (maxi- 
mum depth is 11). Flickr 6 shows the remarkable phe- 
nomenon that popular photos spread slowly and not widely. 
This is in harmony with our findings which report that even 
the largest realizations reach a very small proportion of the 
social network. 

Blog cascades have been analyzed in [20]. Interestingly, 
although one would expect blog cascades to share more sim- 
ilarities with the discussion cascades existing in Slashdot 
or Meneame, it is the Wikipedia dataset which shows most 
similar patterns to the blog cascades (see Figure [S]) . In [TD] , 
a model of both blogger (user) and cascades was presented 
which reproduces global temporal and structural aspects of 
the blogosphere. We note that the motivation of our work 
is rather different. Whereas [2UI I1U| aims for finding the 
simplest, parameter- free model able to describe both user 
network and cascade behavior, we look for a parameterized 
model from which we can describe communication habits 
which characterize a particular website (see Section [4. 4|) . In 
contrast to the blog data, the datasets considered here con- 
tain complete information of the cascade evolution. In this 
sense, our data avoids selection bias which strongly influ- 
ences the estimation of these processes [7]. In [7], a simple 
branching process (Galton- Watson process) is proposed for 
modeling chain-letter cascades. Although such a model may 
explain certain characteristics such as depths distributions 
(after proper correction for selection bias) it cannot capture 
the cascade evolution and assumes that all degree distribu- 
tions are independent, so its utility for our purposes remains 
very limited. 

During the development of this manuscript we learned of 
the work of Kumar et al. [Ilf which also presents a model 
for discussion trees, called T-MODEL. The same study also 
considers other aspects of the cascades such as the identities 
of each member of the conversation. Our work is focused on 
the cascade model, its parameter estimation and validation 
on the four datasets. The same or a different authorship 
model as the one of [16] could also be built on top of the 
model proposed here. 

The T-MODEL is based on linear preferential attach- 
ment only, and unlike ours does not distinguish between 
root and subsequent nodes. However, it includes a recency 
term which allows it to capture qualitatively the relation 
between the sizes and the depths of the cascades. Prelim- 
inary experiments indicate similar ability of our model in 



this aspect. Additionally, the bias to the root considered 
here clearly permits to capture other quantities with higher 
accuracy, such as the degree distributions or the subtree 
sizes. This suggests that at least for the datasets analyzed 
here our model performs better. The maximum likelihood 
estimation scheme presented here finds the best parameters 
of a model given the data, and therefore allows to quantify 
objectively the predictive power of different models. Such 
a comparison between the two cascade models and possible 
hybrid forms is left for future research. 

A further difference between the T-MODEL and our study 
is that it also includes a parameter for the termination of 
the discussion. The resulting termination probability of a 
discussion is independent of its actual structure and could 
be substituted by any other model encoding the popularity 
of discussions (e.g. [26]), which could also be combined with 
our model. 

6. DISCUSSION 

We have presented a thoughtful analysis and comparison 
of the structure and evolution of the different discussion cas- 
cades of three popular news media websites and the En- 
glish Wikipedia. Our analysis highlights the heterogeneities 
between the discussion cascades, which can be conditioned 
from two factors, namely, the page design, or platform, and 
the audience. Despite this, we have given evidence that 
a simple model can capture most of the structural proper- 
ties and the evolution profiles of the real cascades with the 
particularities of each dataset. Further, we have derived a 
rigorous maximum likelihood approach which considers the 
entire evolution of the cascade. The learned parameters of 
the model proposed here allow for a figurative description 
that characterizes the communication habits of a website. 

For some datasets, the model tends to produce too shallow 
cascades. We postulate that this occurs especially in mature 
discussions, where interaction at the leaves only happens be- 
tween a few individuals who start to reply mutually to each 
other and increase the mean cascade depth considerably. A 
possible extension which could correct for that effect is focus 
of current research. 
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APPENDIX 
Log-likelihood function 

In this appendix we describe some considerations related to 
the log likelihood function ([5J we want to minimize. Briefiy, 
we show that the PA model can be formulated as a proba- 
bility distribution which belongs to the exponential family. 
Consequently, the optimization problem is convex, e.g. has 
the convenient property that any local minimum is global. 

Without loss of generality we can assume that parameter 
/3fc is of the following form: 

P^" := exp {P',) . 



We can rewrite the PA model defined in Equation ((2]) as: 
p(TVt = fc|7r(i.t_i)) = exp (/3fe + logdk,t) , 

where — X^i^i ™P (A' + '^i ^'^S di.t)- This probabiUty dis- 
tribution is equivalent to that of the Equation ([2}, but ex- 
pressed in terms of the exponential family. The log-likelihood 
function Q can be rewritten as: 

log/:'(n|0) = ^^/?; + Q,logd,,t,.-logZ^,,(7r.|0), 

i=l t — 2 

where Z^ ^(7r,|0) = JI^Li <^^P ((^i + 1°S The Hes- 
sian of this function (matrix of second order partial deriva- 
tives) is always positive semi-definite. 

The presented method can therefore be applied to any 
set of observations which can be expressed as a collection 
of parent vectors 11 from which the degrees of each node at 
each time-step can be obtained. Once the minimization is 
performed, we can recover the original parameter j3k with 
I3h = exp(/3[./afe) . The basic PA model is the special case 
where a = ai = ac. 

Note that the bias to the root node can be introduced: 

(A) Using two alphas ai, Qc but no /3 (/3 = 0). 

(B) Using one alpha a = ai = ac and /?. 

(C) Using two alphas a = ai = Uc and /? (the approach 
presented in this manuscript). 

As expected, since model (C) uses more parameters than 
(A) and (B), the resulting likelihoods and fits are better. In 
particular, the impact of adding /3 as a parameter is notable 
in the approximated measures related to the root node, for 
instance the root degree distributions. 

Notice that the convexity does not imply uniqueness of op- 
timal parameter values. It could happen that the same min- 
imum is attained for a large range of parameter values. We 
used as an optimization procedure the Nelder-Mead simplex 
algorithm (implemented as fminsearch in Matlab) which is 
an unconstrained non-linear direct search method that does 
not use numerical or analytic gradients. Starting from many 
difi^erent random initial conditions, we did not find multiple 
optimal values in any of the datasets, so we can conclude 
that the presented values for each dataset are unique. 
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